Skip to content

Integrate Sonic MoE finalize changes#4663

Open
A-nnonymous wants to merge 4 commits into
PaddlePaddle:developfrom
A-nnonymous:sonic_moe_finalize
Open

Integrate Sonic MoE finalize changes#4663
A-nnonymous wants to merge 4 commits into
PaddlePaddle:developfrom
A-nnonymous:sonic_moe_finalize

Conversation

@A-nnonymous

Copy link
Copy Markdown

Summary

Notes

  • Prepared for Sonic MoE finalize work demonstration and follow-up develop integration.

@Paddle-CI-Bot

Paddle-CI-Bot commented Jun 13, 2026

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #27486951139 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Fleet Model Test - Integration test (H20, single card) CUDA 设备未正确初始化 与本PR无关,H20 runner 设备初始化异常,CI 维护人员检查 runner GPU 驱动状态,rerun 报错代码
Fleet Model Test - Integration test (H20, multi-card) CUDA 设备未正确初始化 与本PR无关,H20 runner 设备初始化异常,CI 维护人员检查 runner GPU 驱动状态,rerun 报错代码
Unittest GPU CI - upload-coverage GLIBC 版本过低 与本PR无关,runner_04 系统 GLIBC 版本不满足 node24 要求(需 ≥2.27),CI 维护人员升级 runner OS 或降级 actions-runner node 版本 报错代码

失败的测试case:

Fleet Model Test (H20, multi-card):
  - GLM4.5 pre-train
  - GLM4.5 sft
  - GLM4.5 sft cp
  - GLM4.5 lora
  - GLM4.5 dpo
  - GLM4.5 dpo_lora
  - GLM4.5 pre-train (EP4)
  - GLM4.5 pre-train (FP8)
  - GLM4.5 pre-train (Grouped GEMM)
  - Qwen pre-train / sft / lora / vl sft / vl lora / vl moe
  - Qwen3-vl-8k-fsdp

Fleet Model Test (H20, single card):
  - Integration test (GLM4.5 single-card)
  - Integration test (Qwen3-30B-A3B single-card)
  - Qwen3-vl-8k-single-card

Unittest GPU CI:
  - upload-coverage (job 81244365118)

根本原因分析:
所有 H20 runner 上的失败均由同一根因导致:paddlefleet_ops/__init__.py 在 import 阶段调用 paddle.cuda.get_device_capability(),而 runner 的 CUDA 设备未被正确初始化(Paddle 报 Place(cpu)),抛出 ValueError,导致 paddleformers-cli 启动即崩溃,所有任务均未能执行;upload-coverage 失败是 runner_04 的 GLIBC 版本(< 2.25)不满足 GitHub Actions node24 二进制依赖,与 PR 代码无关。

修复建议:

  1. H20 runner 两个 job 均为机器环境问题,直接 rerun,CI 维护人员同步检查 H20 runner GPU 驱动及 CUDA 初始化状态(参考历史相同报错模式)。
  2. upload-coverage job 失败不影响主测试结果(unittest-gpu-ci job 本身已通过),runner_04 的 GLIBC/node 版本问题由 CI 维护人员处理,无需 PR 作者操作。

🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

Run repo pre-commit hooks over the SonicMoE FP8 storage-release changes:
black reformatting in trainer_callback.py and removal of the now-unused
InterleaveGateUpCallback import in trainer.py (flagged by flake8 F401).
No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@A-nnonymous

Copy link
Copy Markdown
Author

/re-run all-failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants